Learning
= General The typical supervised learning problem is: given pairs (X_1, Y_1) , ..., (X_n, Y_n) , where X are called ''input'', or ''explanatory'', variables, and Y are called ''output'', or ''dependent'', variables, and novel examples of input variables X^'_1 , ..., X^'_m , find \hat Y^'_1 , ..., \hat Y^'_m which resemble real Y^'_1 , ..., Y^'_m as close as possible. Each instance of (X_i, Y_i) is called an ''observation''. "As close as possible" often means minimizing RMSE (root mean square error), but other metrics are used as well. The set of pairs (X_i, Y_i) is called the ''training set''. The set of novel examples X^'_1 , ..., X^'_m for which Y's are to be determined is called the ''test set''. The error on the test set is called the ''generalization error''. The problem cannot be solved in general case. But for some classes of joint distribution \rho(X,Y) , optimal or suboptimal algorithms exist, and this turns to be useful for wide classes of datasets. In the simplest case, X_i is a vector of ''features'', all vectors have the same length. Each feature can be quantitative (i.e. car weight), qualitative (i.e. car type), or ordered categorical (i.e. "small", "medium", or "large"). The dependent variable(s) can be quantitative or qualitative as well. Sometimes similar problems, like estimating distribution of Y given novel X, need to be solved. Some learning algorithms depend on the parameters like gradient descent rate which cannot be determined on the training set directly. Such parameters are called ''metaparameters''. When Y is either 0 or 1 for all observations, there are additional measures: * ''Sensitivity'' is probability of the correct prediction when Y=1 (true positive). * ''Specificity'' is probability of the correct prediction when Y=0 (true negative). * ''ROC curve'' (ROC means Receiver Operating Characteristics) is a curve of sensitivity vs specificity, or true positive vs false positive. * ''c-statistic'' - area under ROC curve. Abbreviatures * PRSS - Penalized Residual Sum of Squares. * RSS - Residual Sum of Squares. Bias and variance Suppose we have: * Joint distribution \rho(X,Y) * Some input point X_0 * Number of observations N * Some learning algorithm A Infinite number of times we perform the following experiment: * Create a training set T_i , randomly and independently selecting N observations from the joint distribution. (Here i is the number of the experiment.) * Randomly select Y_i(X_0) from the conditional distribution \rho^\prime(Y|X=X_0) = \rho(X_0,Y) . * Train the algorithm A on the training set T_i and obtain \hat Y_i(X_0) We want to find the error (Y_i-\hat Y_i)^2 , averaged over all '''i'''. In order to find it, we first describe Y_i as a sum of its mathematical expectation and noise: : Y_i = f(X_0) + \epsilon_i , where f(X) is the mathematical expectation of Y(X), while \epsilon_i is a noise. Similarly, we can describe \hat Y_i as : \hat Y_i = E(\hat Y(X_0)) + \Delta Y_i , where E(...) means mathematical expectation, that is averaging over all the experiments. By definition of mathematical expectation, E(\Delta Y)=0 Now we have: : E((\hat Y-Y)^2) = E(E(\hat Y(X_0))-f(X_0) - \Delta Y + \epsilon)^2) Since \epsilon is independent on other terms and its mathematical expectation is 0, this equals to: : E((E(\hat Y(X_0))-f(X_0) - \Delta Y)^2 + E(\epsilon^2) But E(\hat Y(X_0))-f(X_0) is independent on i. Let us call it B . Then the first term is : E((B - \Delta Y)^2) = B^2 - 2BE(\Delta Y) + E((\Delta Y)^2) Since E(\Delta Y)=0 , we finally obtain: : E((\hat Y-Y)^2) = B^2 + E((\Delta Y)^2) + E(\epsilon^2) Here B is called ''bias'', E((\Delta Y)^2) is ''variance'', and \epsilon is ''noise''. Generally, the more complicated the function \hat Y(X) (i. e. the more parameters it has), the lower the bias and the higher the variance. As for the noise, it is irreducible and independent on the algorithm or the number of observations. Selecting metaparameters To select the number of parameters, two alternative criteria can be used * ''Akaike Information Criterion'' (AIC) - -2\ln L + k * ''Bayesian Information Criterion'' (BIC) - -2\ln L + k \ln n where L is likelihood, k is number of free parameters, n is number of observations in the training set. According to wikipedia's article [[wikipedia:Akaike information criterion|Akaike information criterion]], AIC seems to be better than BIC. * ''Vapnik-Cherniakovskis dimension'' is the maximal number of points which can be always separated by the model, no matter how they are categorized and what are their coordinates. For VC dimension h, and N observations, define \rho=h/N , then the effective error is \overline{\mathrm}\left(1-\sqrt{\rho-\rho\log\rho+{\log N\over N}}^{-1}\right)_+ . * Cross-validation. * Bootstrap: 0.368\overline\mathrm{err}_{ins}+0.632\overline\mathrm{err}_{outs} * Parametric bootstrap: add Haussian noise to target variables. AIC, CV and bootstrap produce similar quality models, close to the best avaliable. BIC is slightly worse than bootstrap, VC is worse than BIC. CV and bootstrap provide close estimates for the test error, AIC doesn't (too optimistic). Function approximation A common approach is to approximate Y as : Y=f_\theta(X) where f_\theta is a parametrized family of functions. The learning is finding the most suitable \theta . Minimizing residual sum of squares If each \theta has the same prior probability, and the real model is Y=f_\theta(x)^\prime+\epsilon , where \epsilon is a Gaussian noise, then the maximal likelihood is for \theta minimizing : \mathrm{RSS}=\sum_i(f_\theta(X_i)-Y_i)^2 Here RSS means Residual Sum of Squares. Minimizing penalized residual sum of squares Often, prior probability is different for different \theta , therefore we use ''Penalized Residual Sum of Squares'' (''PRSS''). For example, since functions with high second derivative are improbable, the following PRSS is used for ''smoothing splines'': : \mathrm{PRSS}=\sum_i(f_\theta(X_i)-Y_i)^2 + \lambda\int (f''(X))^2 dx Here \lambda is a metaparameter of learning. Bayesian methods Given set of observations Z, the posterior probability is: : P(\theta|Z) = \frac{P(Z|\theta)P(\theta)}{\int P(Z|\theta^\prime)P(\theta^\prime)d\theta^\prime} The probability for a new observation z is : P(z|Z) = \int P(z|\theta) P(\theta|Z) d\theta Expectation Maximization The idea of Expectation Maximization is to introduce latent variables. At each iteration, estimate latent data based on the model parameters (expectation step), then model parameters based on latent data (maximization step). Repeat till convergence. Example: multicomponent mixture model \rho(x) = \sum_{k=1}^K N(x; \mu_k, \sigma_k) : Expectation step: determine probability p_{ik} that observation i belongs to the component k :: \hat p_{ik} \leftarrow N(x_i; \hat\mu_k, \hat\sigma_k) / \sum_m N(x_i; \hat\mu_m, \hat\sigma_m) : Maximization step : given p's, determine \mu 's and \sigma 's :: \hat\mu_k \leftarrow \sum_i \hat p_{ik} x_i / \sum_i \hat p_{ik} :: \hat\sigma_k \leftarrow \sum_i \hat p_{ik} \left(x_i-\hat\mu_k\right)^2 / \sum_i \hat p_{ik} Bagging and stacking Make bootstraps and then let \hat f(x) = {1 \over B} \sum_{i=1}^{B} \hat f_i(x) , where \hat f_i(x) - prediction of i-th bootstrap. Useful for trees, for example. For categorical probabilitites, \hat f_i(x) should be also categorical probabilities, not (0,\; ...,\; 0,\; 1,\; 0,\; ...,\; 0) . If, instead of averaging, we choose the best model, this approach is called bumping. Linear combination of different models, or another combination of different models, is called stacking. Linear model Let all features be numerical, and Y be a scalar. The linear model assumes linear dependence of Y on X: : \hat Y=\beta X + \beta_0 Adding the new feature X_0=1 , we have : \hat Y=\beta X Then : \textrm{RSS}(\beta) = \sum_i\left(y_i-\beta x_i\right)^2 In matrix form: : \mathrm{RSS}(\beta) = \left(\textbf{y}-\textbf{X}\beta\right)^T \left(\textbf{y}-\textbf{X}\beta\right) Minimizing it, we obtain the best fit: : \hat \beta = (\mathbf{X}^T X)^{-1} \mathbf{X}^T \mathbf{y} Here, X^T X is called the ''covariance matrix''. Distribution of the estimated coefficients If : y_i = \beta\mathbf{x}_i + \epsilon_i , where : \epsilon \sim N(0, \sigma) and noise for different observations is independent, we can substitute this into the best fit formula and obtian: : E(\hat\beta)=\beta : \operatorname{Var}(\hat\beta) = (X^T X)^{-1}\sigma^2 The typical estimation of \sigma^2 is: : \hat\sigma^2 = {1 \over N-p-1} \sum_i (y_i - \hat y_i)^2 Here p is the number of dimensions, and this formula has the attractive feature: : E(\hat\sigma^2) = \sigma^2 Since \beta is a linear function of (\epsilon_1,...,\epsilon_n) , which is normal centered at zero, the distribution of \hat\beta must be also normal. But we already know its mean and variance, therefore \hat\beta\sim N(\beta, (X^T X)^{-1}\sigma^2) Null hypothesis In order to test whether i-th coefficient is significantly non-zero, we should calculate : z_i = {\beta_i \over \sqrt{\left((X^T X)^{-1}\right)_{ii}}\sigma} This number is called ''Z-score''. If i-th coefficient is zero, z_i \sim N(0,1) . Gauss-Markov theorem The linear model has the least variance among all linear unbiased estimators. Indeed, the mean squared error is the sum of squared bias, variance, and mean squared noise. Considered the linear model and any other ubiased linear estimator, their bias is the same (zero), and their mean squared noise is also the same (it is the same for all models). Since another model has higher mean squared error, it also has higher variance. Regularization Dropping or penalizing the variables whose statistical significance is poor may improve the generalization error for two reasons: * The actual coefficient may be actually zero (or very small comparing to its estimation). * The actual coefficient may be far away from its estimation. There are two kinds of regularization: subset selection (every variable is either retained or discarded) and shrinkage (some variables are penalized, but not discarded). For all regularization methods there is a metaparameter (number of variables retained, the overall penalty multiplier etc.) It can be determined via training/validation division, cross-validation, AIC/BIC critheria. Subset selection The advantage of subset selection is its usefullness for interpreting the results. * Best-subset selection: check all possible subsets, and choose the best one for each size. The size is our metaparameter. Disadvantage: need 2^p linear regressions (unless p>N, in which case we need "only" \binom{p}{n} linear regressions). Note however that the covariance matrix should be calculated only once, we only need to take its submatrices. * Forward stepwise: sequentially add the predictor which most improves the fit. Can use QR-decomposition for the current N×q submatrix of X (q is the current number of predictors) to rapidly find one. This may be better than the best-subset selection in terms of generalization error, and it is certainly faster. * Backward stepwise: start with the complete set of predictors, and sequentially drop the predictor with the smallest absolute Z-score. Advantage: easier to implement than the forawrd stepwise. Disadvantage: does not work for p>N. (For p=N, Z-scores are undefined, but we can try to delete each variable and check which of them retains the minimal residual error.) Another disadvantage: if p is big, and the number of variables to retain is small, can be more computationally intensive than the forward stepwise. * Hybrid approaches: at each iteration, do either forward or backward step depending on AIC, for example. * Forward stagewise: at each step, select the variable most correlated with the residuals, run the regression using this variable as the only input and the residuals as the output, and add the coefficient for this variable. Repeat until all correlations with residuals are (almost) zero. This method is slow (since it can examine the same variable many times), but sometimes may provide better generalization. * Incremental forward stagewise: as above, but increase the predictor coefficient by \pm epsilon only. Shrinkage methods Generally, they are smoother than subset selection methods and may lead to less variance. * Ridge regression: apply penalty \lambda\lVert\hat\beta\rVert^2 . This makes the formula \hat\beta=\left(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I}\right)\mathbf{X}^T\mathbf{y} . This is the same as multiplying SVD components of X by d_i^2\over d_i^2+\lambda , where d_i is i-th diagonal element of the diagonal matrix obtained in SVD, or d_i^2 is N times the variance of the i-th component. This method does not delete any variables. * Lasso: apply penalty \lambda\sum_i|\hat\beta_i| . Some of the coefficient may become exactly zero, and for big enough \lambda all of them are. Therefore, this method deletes variables, but does it smoothly, since each coefficient is a continuous function of \lambda . ** Can be computed using LARS algorithm ** Or start with high \lambda , at each step decrease \lambda slightly and cycle through variables until convergence. * Combination of above: apply both penalties (with different \lambda ). Can also apply penalty on all non-zero terms (best-subset). * Least angle regression: similar to the forward stepwise. The coefficients are initially zero. Find the most correlated variable, add it to the active set, and increase the coefficient until it is as correlative with the residuals as some competitor. Increase the coefficients for both (in the direction of gradient descent of the RSS), until there is another variable as correlative with the residuals as both of them. Increase the coefficients of all the three etc. Intermediate * Principal Components Regression: retain only first several components, discard the others. * Partial least squares: at m-th iteration, do the following: : \mathbf{z}_m = \sum_i <\mathbf{x}_i^{m-1}, \mathbf{y}> \mathbf{x}_i^{m-1} : run the linear regression from \mathbf{z}_m to '''y''', add to the prediction : orthogonalize x's with respect to \mathbf{z}_m Basis functions f may be approximated as a linear combination of ''basis functions'' h_1(X) , ..., h_M(X) : : f_\theta(X) = \sum_i \theta_m h_m(X) Sometimes, the basis functions are known in advance and we only need to learn the coefficients \theta_1 , ..., \theta_M . In other cases, they are also parameter-dependent. Such methods are called ''dictionary methods'' which means that we have a (possibly infinite) "dictionary" of basis functions and search the functions we need. Examples: * Splines * Radial basis functions - multidimensional kernels located at centroids; Haussian kernels are popular * Single-layer feed-forward neural nets, h_m(X) = \sigma(a_m X+b_m) , \sigma(t)=1/(1+e^{-t}) Splines For order M splines, : h_m(x) = \begin{cases} x^{m-1}, & m\le M \\ (x-x_{m-M})_+^{M-1}, & M+1 \le m \le M+K \end{cases} where a_+ = \max(a,0) . The points x_1 , ..., x_K are externally given and called ''knots''. They have (M-2) continuous derivatives. Often, M=4 is used: : h_m(x)= \begin{cases} x^{m-1}, & m\le 4 \\ (x-x_{m-M})_+^3, & 5 \le M \le K+4 \end{cases} Such splines are called ''cubic splines''. Splines of order 2 are called ''linear splines''. Natural splines ''Natural splines'' are cubic splines which are linear outside [x_1, x_K] . To construct the component functions, notice that (x-x_k)_+^3 - (x-x_K)_+^3 is a square function at [x_K, \infty] , whose second derivative is 6(x_K-x_k) . Therefore, if : d_k(x) = \frac{(x-x_k)_+^3 - (x-x_K)_+^3}{x_K-x_k} , then d_k''=6 at [x_K, \infty] , and : d_{k+1}(x)-d_k(x) is linear at [x_K, \infty] Since d_K is undefined (0/0), we have K-2 functions d_2-d_1 , ... , d_{K-1}-d_{K-2} . We also have 2 functions f(x)=1 and f(x)=x. Totally, we have K functions instead of K+4 for cubic splines. The missing 4 functions correspond to 4 constraints: square and cubic terms must be 0 at both x and x>x_K . Smoothing splines Apply the penalty \lambda\int \left(f^{\prime\prime}(x)\right)^2dx , where \lambda is a metaparameter. Multidimensional splines The basis functions are h_{ij}(\mathbf{x}) = h_i^{(1)}(x_1) h_j^{(2)}(x_2) for 2-dimensional case, similar for higher dimensions. Wavelet functions These reward sparseness rather than smoothness. The basis is: : \psi_{m,n}(t)=2^{-m/2}\psi(2^{-m}t-nb) . Sometimes, another number instead of 2 can be used. Here \psi(x)=\phi(2x)-\phi(2x-1) , \phi is called a ''mother wavelet''. If the penalty is 2\lambda|\boldsymbol{\theta}| , then : \theta_i = \operatorname{sign} y^*_i\left(|y^*_i|-\lambda\right) where y^*_i is the i-th coefficient before penalty, \mathbf{y}^*=\mathbf{W}^\mathrm{T}\mathbf{y} . A simple choice is \lambda=\sigma\sqrt{2\log n} , where \sigma is the estimated standard deviation of the noise of y's. \mathbf{y}^* can be calculated in linear time using pyramide schemes. Radial basis functions and mixture model ''Radial basis functions'' (RBF) are: : h_i(\mathbf{x}) = D(||\mathbf{x}-\mathbf{c}_i||/r_i) If all r's are the same, this may cause holes. To avoid holes, one can use ''renormalized radial basis functions'': : h_i(\mathbf{x}) = {D(||\mathbf{x}-\mathbf{c}_i||/r)\over \sum_j D(||\mathbf{x}-\mathbf{c}_j||/r)} If, instead of D, we use an arbitrary Gaussian kernels (with arbitrary covariance matrix), this is called ''mixture model''. Local models kNN KNN (or kNN) means "K Nearest Neighbours". The prediction for Y is assumed to be the average of Y in several neighbour points: : \hat Y(x) = {1\over k} \sum_{i=1}^k Y(x_i(x)) where x_i is i-th closest point to x, k is a metaparameter. Curse of dimensionality Given n (number of observations), and increasing number of features (dimension), the distance even to the closed nearest neighbour soon becomes comparable to the distance to an average point, and then becomes increasingly close to it. Therefore, when there are enough features the closest neighbours are not much closer than other points, and the kNN model does not work. It fails at O(log {n \over k}) features. Another way to formulate the same problem: if we increase the dimensionality, and want the k nearest neighbours to be at the distance not more than some fixed r (comparing to the average distance to the points), the number of observations we need is increasing exponentially. Another problem with dimensionality: the higher the dimensionality, the closer an average point will be to boundaries of the region. Not only the region where the neighbours of X reside will be big, but also X will be out of it. Kernel methods Kernel methods minimize ''weighted'' RSS, which weights points close to the given point higher than the distant points: : To obtain \hat Y(X) , :: Minimize \sideset{}{_i}\sum K(X_i,X)(Y_i-f_\theta(X_i))^2 with respect to \theta :: Set \hat Y(X) = f_\theta(X) Here K is called the kernel. The Haussian kernel is popular. If f_\theta=\mathrm{const} , we obtain the Nadaraya-Watson weighted average: : \hat Y(X) = \frac{\sum_i K(X, X_i) Y_i}{\sum_i K(X, X_i)} If f_\theta is linear function, we have the ''local linear regression'' model. Kernel methods are also prone to the dimensionality curse. Kernels and regressions * ''Epanechnikov kernel'': {3 \over 4}(1-t^2) if |t|<1 , otherwise 0 * ''Tri-cube kernel'': (1-|t|^3)^3 if |t|<1 , otherwise 0 * ''Local polynomial regression'': fit polynomials * f(X) = \alpha(Z) + \beta(Z)^\mathrm{T}(X_1, ..., X_m) , where Z=(X_{m+1},...,X_p) * see also [[#ANOVA|]] Additive and similar models Generalized additive model For ''generalized additive model'' (GAM), target function is a sum of one-dimensional functions: : f(x) = \alpha + \sum_i f_i(x_i)\! Algorithm: estimate f_i one by one in a loop. Each time estimate the residual \mathbf{y}-\sum_{j\ne i} \hat f_j(\mathbf{x})\! as a smooth spline and subtract its average. Additive logistic regression : {\log(Pr(x=1))\over\log(Pr(x=0))} = \alpha + \sum_i f_i(x_i) Algorithm. Let L be logistic function, L(z)=1/(1+exp(-z)) . At each iteration, for each observation i, set : s_i = \alpha + \sum_i f_i(x_i)\text{, }1 \le i \le n . : \hat p_i = L(s_i) : t_i = s_i + (y_i - \hat p_i) / L^\prime(s_i) (target) : w_i = L^\prime(s_i) (weight) Find an additive model for targets '''t''' with weights '''w''' . (Note that L^\prime(s_i) = p_i(1-p_i) .) ANOVA * ''ANOVA'' (analysis of variance) - \hat Y = a + \sum_i f_i(x_i) + \sum_i\sum_{j Decision trees and similar Decision tree is a binary tree whose leafs contain forecasts for Y, and any other nodes contains a predicate depending on explanatory variables. Its forecast equal to the forecast: of its root if its root is a leaf; of its left subtree if the predicate in the root node satisfies; of its right subtree otherwise. Classification And Regression Trees (CART) Below we consider those regression trees whose non-leaf nodes contain only predicates of type X_k\le a . Here X_k is called ''splitting variable'' We have a set of n observations S with p explanatory variables and want to transform it to such decision tree. Regression tree Regression tree is a decision tree whose leafs predict mathematical expectation of Y. Its creation consists on two phase: ''growing'' and ''shrinking''. Growing If the number of observations is above some threshold (often 5), find the pair (k, a), 1\le k\le p, a\in R such that for the dichotomy (S_1=\{i|X_{ik}\le a\},\; S_2=\{i|X_{ik}>a\}) : |S_1|Var(\{Y_i|i\in S_1\}) + |S_2|Var(\{Y_i|i\in S_2\}) is minimal. Make (X_k\le a) the top of the regression tree. Recursively grow regression trees from S_1 and S_2 . If the number of observations isn't above the threshold, just create the leaf whose forecast is the average of targets of all the observations from S. Shrinking At each iteration, replace a subtree with one leaf. Select a subtree such that the increase in RSS per number of nodes in the subtree is minimal. This is called ''weakest link pruning''. Continue until the tree shrinks to only one leaf. The sequence of trees minimizes \text{RSS}-\alpha|T| for various \alpha . Find by 5-fold or 10-fold cross-validation the best \alpha . Classification trees Instead of RSS, use one of the following criteria for growing and shrinking: : 1-\hat p_{mk(m)}\! (misclassification error) : \sum_k p_{mk}(1-p_{mk})\! (Gini index) : \sum_k p_{mk} \log p_{mk}\! (cross-entropy or deviance) Modifications and issues of CART Unordered categorical predictors There are 2^{q-1}-1 splits for q categories. However, if the output is 0 or 1 (classification tree with 2 classes), we should sort the categories by the fraction of 1's, and treat the predictor as sorted categorical. Missing variables Approaches: * Discard observations with missing variables * Impute missing variables (mean over non-missing values) * For categorical variables: consider "missing" as a new category * Surrogate variables: find a variable and a split which best mimicks the original split. If this variable also can be missing, find a second surrogate variable etc. Ternary and other multiway splits Not recommended, deplete the set of observation too fast. Linear combination splits Use predicates \sum_i a_i X_i \le b instead of X_i \le a . Hurts interpretability but improves the predictive power. HME (Hierarchical Model of Experts) is considered to be better. Disadvantages of CART * High variance * Lack of smoothness (the function is not continuous) * Difficulty in capting additive structure Bagging adresses the first one, MARS addresses the other two. PRIM ''Patient rule induction method'' (PRIM) is designed for regression and can be used also for 2-value categorical data. The algorithm is the following: * Start with the rectangle box containing all points * Shrink the box by moving any side of the rectangle towards the center (i.e. [x_1,x_2]*[y_1,y_2] \rightarrow [x_1,x^\prime_2]*[y_1,y_2] ) to maximize the average within the box. Repeat till it is maximized, or the number of observation in the box reaches some predefined value. * Expand the box similarly to increase the average. * Choose the optimal box size with cross-validation. * Remove the points found at this step in the box and repeat the process, until we get as many boxes as desired. Classification Output G belongs to one of K classes. We are to determine the most probable class. In this section we present models used explicitly for classification. Discriminant analysis Let f_k(X) be the density of points belonging to a class k (''class-conditional density'') and \pi_k be the prior probability of the class k. Then the probability that some point X belongs to the class k is: : P_k(X) = {f_k(X)\pi_k\over \sum_i f_i(X)\pi_i} Consider the distribution to be Gaussian: : f_k(X) = {1 \over (2\pi)^{p/2}|\Sigma_k|^{1/2}} exp\left(-{1\over 2}(X-\mu_k)^T\Sigma_k(X-\mu_k)\right) If \Sigma is the same for all groups, the model is called '''linear discriminant analysis''' (LDA), otherwise it is called '''quadratic discriminant analysis''' (QDA). The borders between classes are linear for the former and quatratic for the latter, thats why the name. Regularized discriminant analysis The formula : \hat\Sigma_k(\alpha) = \alpha\hat\Sigma_k + (1-\alpha)\hat\Sigma (the first term from QDA, the second term from LDA) is called '''regularized discriminant analysis'''. Here \alpha is a metaparameter. Space transformations For LDA, we can linearly transform the space so that \Sigma will be unitary. Then we can project all points to subspace spanned by centroids. We can further decompose the subspace into successfully optimal subspaces in terms of centroids separation, by minimizing a^T W a \over a^T B a w. r. to a, where W is within-class variance, B is between-class variance. Utilizing points with unknown class Note that class-conditional densities should satisfy the constraint P(X)=\sum_k \pi_k f_k(X) With this constraint, even points with unknown class can be used to improve prediction for these models. Logistic regression : P_k(X) = {exp(\beta_k X) \over \sum_l exp(\beta_l X)} , and \beta_K=0 Find betas with Newton-Raphson algorithm. We can drop the least significant variables, or, better, those which contribute to the log-likelihood minimally. We also can penalize betas in a similar way to the lasso model, subtracting \lambda\sideset{}{_i}\sum|\beta_i| from the log-likelihood. Separating plane We have a set (\mathbf{x}_1, y_1) , ..., (\mathbf{x}_n, y_n) , y_i\in {-1,1} . We are to find the weights \beta such that \beta x_i > 0 for all i. Rosenblatt's Perceptron Learning Algorithm For each '''misclassified''' observation, weights are updated as : \beta \leftarrow \beta + \rho x_i y_i It can be proved that the formula converges to a solution, if one exists. When the weights are updated after each observation, it is called ''stochastic gradient descent''. Optimized Separating Hyperplanes We are to maximize the distance to the closed point from either class. : \max_{\beta,M} M \;\; \mathrm{ s. t. }\;\; \Vert\beta\Vert^2=1,\;\; y_i\mathbf{x_i}^\mathrm{T}\beta \ge M,\;\;i=1,...,n We can get rid of the \Vert\beta\Vert^2=1 constraint: : \max_{\beta,M} M\;\; \mathrm{ s. t. }\;\; y_i\mathbf{x_i}^\mathrm{T}\beta \ge M\Vert\beta\Vert,\;\;i=1,...,n Indeed, since \Vert\beta\Vert=1 , we can replace M by M\Vert\beta\Vert . After this, since multiplying \beta by any positive constant does not affect the inequality, we can drop the constraint \Vert\beta\Vert=1 . Since the inequality is scalable now, we can set \Vert\beta\Vert=\frac{1}{M} : : \min_\beta{1\over 2}\Vert\beta\Vert^2\;\; \mathrm{ s. t. }\;\; y_i\mathbf{x_i}^\mathrm{T}\beta \ge 1 Solving this with Lagrange multipliers, we obtain the dual form (''Wolfe dual''): : \max_\alpha L_D = \sum_i \alpha_i - {1\over 2}\sum_{ij}\alpha_i\alpha_j y_i y_j \mathbf{x}_i^\mathrm{T} \mathbf{x}_j :subject to \alpha_i\ge 0 for all i This is a convex problem which can be solved. Points for which \alpha>0 are called ''support points''. = Comments